CONTENTS

Chapter 10. Logging

A good maxim of war is "know your enemy," and the same advice applies to business. You need to know your customers or, on a web site, your visitors. Everything you can know about them is in the Environment variables (discussed in Chapter 16) that Apache gets from the incoming request. Apache's logging directives, which are explained in this chapter, extract whichever elements of this data you want and write them to log files.

However, this is often not very useful data in itself. For instance, you may well want to track the repeated visits of individual customers as revealed by their cookie trail. This means writing rather tricky CGI scripts to read in great slabs of log file, break them into huge, multilevel arrays, and search the arrays to track the data you want.

10.1 Logging by Script and Database

If your site uses a database manager, you could sidestep this cumbersome procedure by writing scripts on the fly to log everything you want to know about your visitors, reading data about them from the environment variables, and recording their choices as they work through the site. Depending on your needs, it can be much easier to log the data directly than to mine it out of the log files. For instance, one of the authors (PL) has a medical encyclopedia web site (www.Medic-Planet.com). Simple Perl scripts write database records to keep track of the following:

Having stored this useful information in the database manager, it is then not hard to write a script, accessed via an SSL connection (see Chapter 11), which can only be accessed by the site management to generate HTML reports with totals and statistics that illuminate marketing problems.

10.2 Apache's Logging Facilities

Apache offers a wide range of options for controlling the format of the log files. In line with current thinking, older methods (RefererLog, AgentLog, and CookieLog) have now been replaced by the config_log_module. To illustrate this, we have taken ... /site.authent and copied it to ... /site.logging so that we can play with the logs:

User webuser
Group webgroup
ServerName www.butterthlies.com

IdentityCheck	on
NameVirtualHost 192.168.123.2
<VirtualHost www.butterthlies.com>
LogFormat "customers: host %h, logname %l, user %u, time %t, request %r,
    status %s,bytes %b,"
CookieLog logs/cookies
ServerAdmin [email protected]
DocumentRoot /usr/www/APACHE3/site.logging/htdocs/customers
ServerName www.butterthlies.com
ErrorLog /usr/www/APACHE3/site.logging/logs/customers/error_log
TransferLog /usr/www/APACHE3/site.logging/logs/customers/access_log
ScriptAlias /cgi_bin /usr/www/APACHE3/cgi_bin
</VirtualHost>
<VirtualHost sales.butterthlies.com>
LogFormat "sales: agent %{httpd_user_agent}i, cookie: %{http_Cookie}i, 
    referer: %{Referer}o, host %!200h, logname %!200l, user %u, time %t,
    request %r, status %s,bytes %b,"
CookieLog logs/cookies
ServerAdmin [email protected]
DocumentRoot /usr/www/APACHE3/site.logging/htdocs/salesmen
ServerName sales.butterthlies.com
ErrorLog /usr/www/APACHE3/site.logging/logs/salesmen/error_log
TransferLog /usr/www/APACHE3/site.logging/logs/salesmen/access_log
ScriptAlias /cgi_bin /usr/www/APACHE3/cgi_bin
<Directory /usr/www/APACHE3/site.logging/htdocs/salesmen>
AuthType Basic
AuthName darkness
AuthUserFile /usr/www/APACHE3/ok_users/sales
AuthGroupFile /usr/www/APACHE3/ok_users/groups
require valid-user
</Directory>
<Directory /usr/www/APACHE3/cgi_bin>
AuthType Basic
AuthName darkness
AuthUserFile /usr/www/APACHE3/ok_users/sales
AuthGroupFile /usr/www/APACHE3/ok_users/groups
#AuthDBMUserFile /usr/www/APACHE3/ok_dbm/sales
#AuthDBMGroupFile /usr/www/APACHE3/ok_dbm/groups
require valid-user
</Directory>
</VirtualHost>

There are a number of directives.

ErrorLog  

ErrorLog filename|syslog[:facility] 
Default: ErrorLog logs/error_log
Server config, virtual host
 

The ErrorLog directive sets the name of the file to which the server will log any errors it encounters. If the filename does not begin with a slash (/), it is assumed to be relative to the server root.

figs/unix.gif

If the filename begins with a pipe (|), it is assumed to be a command to spawn a file to handle the error log.

Apache 1.3 and Above

figs/unix.gif

Using syslog instead of a filename enables logging via syslogd(8) if the system supports it. The default is to use syslog facility local7, but you can override this by using the syslog:facility syntax, where facility can be one of the names usually documented in syslog(1). Using syslog allows you to keep logs for multiple servers in a centralized location, which can be very convenient in larger installations.

Your security could be compromised if the directory where log files are stored is writable by anyone other than the user who starts the server.

TransferLog  

TransferLog [ file | "| command "]
Default: none
Server config, virtual host
 

TransferLog specifies the file in which to store the log of accesses to the site. If it is not explicitly included in the Config file, no log will be generated.

file

This is a filename relative to the server root (if it doesn't start with a slash), or an absolute path (if it does).

command

Note the format: "| command". The double quotes are needed in the Config file. command is a program to receive the agent log information on its standard input. Note that a new program is not started for a virtual host if it inherits the TransferLog from the main server. If a program is used, it runs using the permissions of the user who started httpd. This is root if the server was started by root, so be sure the program is secure. A useful Unix program to which to send is rotatelogs,[1] which can be found in the Apache support subdirectory. It closes the log periodically and starts a new one, and it's useful for long-term archiving and log processing. Traditionally, this is done by shutting Apache down, moving the logs elsewhere, and then restarting Apache, which is obviously no fun for the clients connected at the time!

AgentLog  

AgentLog file-pipe
AgentLog logs/agent_log
Server config, virtual host
Not in Apache v2
 

The AgentLog directive sets the name of the file to which the server will log the User-Agent header of incoming requests. file-pipe is one of the following:

A filename 
A filename relative to the ServerRoot. 
"| <command>" 

This is a program to receive the agent log information on its standard input. Note that a new program will not be started for a VirtualHost if it inherits the AgentLog from the main server.

If a program is used, then it will be run under the user who started httpd. This will be root if the server was started by root; be sure that the program is secure.

Also, see the Apache security tips document discussed in Chapter 11 for details on why your security could be compromised if the directory where log files are stored is writable by anyone other than the user that starts the server.

This directive is provided for compatibility with NCSA 1.4.

LogLevel  

LogLevel level 
Default: error
Server config, virtual host
 

LogLevel controls the amount of information recorded in the error_log file. The levels are as follows:

emerg

The system is unusable — exiting. For example:

"Child cannot open lock file. Exiting" 
alert

Immediate action is necessary. For example:

"getpwuid: couldn't determine user name from uid" 
crit

Critical condition. For example:

"socket: Failed to get a socket, exiting child" 
error

Client is not getting a proper service. For example:

"Premature end of script headers"
warn

Nonthreatening problems, which may need attention. For example:

"child process 1234 did not exit, sending another SIGHUP"
notice

Normal events, which may need to be evaluated. For example:

"httpd: caught SIGBUS, attempting to dump core in ..."
info

For example:

"Server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers)..." 
debug

Logs normal events for debugging purposes.

Each level will report errors that would have been printed by higher levels. Use debug for development, then switch to, say, crit for production. Remember that if each visitor on a busy site generates one line in the error_log, the hard disk will soon fill up and stop the system.

LogFormat  

LogFormat format_string [nickname]
Default: "%h %l %u %t \"%r\" %s %b"
Server config, virtual host
 

LogFormat sets the information to be included in the log file and the way in which it is written. The default format is the Common Log Format (CLF), which is expected by off-the-shelf log analyzers such as wusage (http://www.boutell.com/) or ANALOG, so if you want to use one of them, leave this directive alone.[2] The CLF format is as follows:

host ident authuser date request status bytes
host

Hostname of the client or its IP number.

ident

If IdentityCheck is enabled and the client machine runs identd, the identity information reported by the client. (This can cause performance issues as the server makes identd requests that may or may not be answered.)

authuser

If the request was for a password-protected document, is the user ID.

date

The date and time of the request, in the following format:

[day/month/year:hour:minute:second  tzoffset].
request

Request line from client, in double quotes.

status

Three-digit status code returned to the client.

bytes

The number of bytes returned, excluding headers.

The log format can be customized using a format_string. The commands in it have the format %[condition]key_letter ; the condition need not be present. If it is and the specified condition is not met, the output will be a -. The key_letter s are as follows:

%...a: Remote IP-address 
%...A: Local IP-address 
%...B: Bytes sent, excluding HTTP headers. 
%...b: Bytes sent, excluding HTTP headers. In CLF format i.e. a '-' rather than a 0 
when no bytes are sent. 
%...{Foobar}C: The contents of cookie "Foobar" in the request sent to the server. 
%...D: The time taken to serve the request, in microseconds. 
%...{FOOBAR}e: The contents of the environment variable FOOBAR 
%...f: Filename 
%...h: Remote host 
%...H The request protocol 
%...{Foobar}i: The contents of Foobar: header line(s) in the request sent to the 
server. 
%...l: Remote logname (from identd, if supplied) 
%...m The request method 
%...{Foobar}n: The contents of note "Foobar" from another module. 
%...{Foobar}o: The contents of Foobar: header line(s) in the reply. 
%...p: The canonical Port of the server serving the request 
%...P: The process ID of the child that serviced the request. 
%...q The query string (prepended with a ? if a query string exists, otherwise an 
empty string) %...r: First line of request 
%...s: Status. For requests that got internally redirected, this is the status of the 
*original* request --- 
%...>s for the last. 
%...t: Time, in common log format time format (standard english format) %...
{format}t: The time, in the form given by format, which should be in strftime(3) 
format. (potentially localized) 
%...T: The time taken to serve the request, in seconds. 
%...u: Remote user (from auth; may be bogus if return status (%s) is 401) 
%...U: The URL path requested, not including any query string. 
%...v: The canonical ServerName of the server serving the request. 
%...V: The server name according to the UseCanonicalName setting. 
%...X: Connection status when response is completed. 'X' = connection aborted before 
the response completed. '+' = connection may be kept alive after the response is 
sent. '-' = connection will be closed after the response is sent. (This directive was 
%...c in late versions of Apache 1.3, but this conflicted with the historical ssl %...{var}c syntax.) 

The format string can contain ordinary text of your choice in addition to the % directives.

CustomLog  

CustomLog file|pipe format|nickname
Server config, virtual host
 

The first argument is the filename to which log records should be written. This is used exactly like the argument to TransferLog; that is, it is either a full path, relative to the current server root, or a pipe to a program.

The format argument specifies a format for each line of the log file. The options available for the format are exactly the same as those for the argument of the LogFormat directive. If the format includes any spaces (which it will in almost all cases), it should be enclosed in double quotes.

Instead of an actual format string, you can use a format nickname defined with the LogFormat directive.

10.2.1 site.authent — Another Example

site.authent is set up with two virtual hosts, one for customers and one for salespeople, and each has its own logs in ... /logs/customers and ... /logs/salesmen. We can follow that scheme and apply one LogFormat to both, or each can have its own logs with its own LogFormats inside the <VirtualHost> directives. They can also have common log files, set up by moving ErrorLog and TransferLog outside the <VirtualHost> sections, with different LogFormats within the sections to distinguish the entries. In this last case, the LogFormat files could look like this:

<VirtualHost www.butterthlies.com>
LogFormat "Customer:..."
...
</VirtualHost>

<VirtualHost sales.butterthlies.com>
LogFormat "Sales:..."
...
</VirtualHost>

Let's experiment with a format for customers, leaving everything else the same:

<VirtualHost www.butterthlies.com>
LogFormat "customers: host %h, logname %l, user %u, time %t, request %r
    status %s, bytes %b,"
...

We have inserted the words host, logname, and so on to make it clear in the file what is doing what. In real life you probably wouldn't want to clutter the file up in this way because you would look at it regularly and remember what was what or, more likely, process the logs with a program that would know the format. Logging on to www.butterthlies.com and going to summer catalog produces this log file:

customers: host 192.168.123.1, logname unknown, user -, time [07/Nov/
    1996:14:28:46 +0000], request GET / HTTP/1.0, status 200,bytes -
customers: host 192.168.123.1, logname unknown, user -, time [07/Nov/
    1996:14:28:49 +0000], request GET /hen.jpg HTTP/1.0, status 200,
    bytes 12291,
customers: host 192.168.123.1, logname unknown, user -, time [07/Nov
    /1996:14:29:04 +0000], request GET /tree.jpg HTTP/1.0, status 200,
    bytes 11532,
customers: host 192.168.123.1, logname unknown, user -, time [07/Nov/
    1996:14:29:19 +0000], request GET /bath.jpg HTTP/1.0, status 200,
    bytes 5880,

This is not too difficult to follow. Notice that while we have logname unknown, the user is -, the usual report for an unknown value. This is because customers do not have to give an ID; the same log for salespeople, who do, would have a value here.

We can improve things by inserting lists of conditions based on the error codes after the % and before the command letter. The error codes are defined in the HTTP 1.0 specification:

200 OK
302 Found
304 Not Modified
400 Bad Request
401 Unauthorized
403 Forbidden
404 Not found
500 Server error
503 Out of resources
501 Not Implemented
502 Bad Gateway

The list from HTTP 1.1 is as follows:

100  Continue
101  Switching Protocols
200  OK
201  Created
202  Accepted
203  Non-Authoritative Information
204  No Content
205  Reset Content 
206  Partial Content
300  Multiple Choices
301  Moved Permanently
302  Moved Temporarily
303  See Other
304  Not Modified
305  Use Proxy
400  Bad Request
401  Unauthorized
402  Payment Required
403  Forbidden
404  Not Found
405  Method Not Allowed
406  Not Acceptable
407  Proxy Authentication Required
408  Request Time-out
409  Conflict
410  Gone
411  Length Required
412  Precondition Failed
413  Request Entity Too Large
414  Request-URI Too Large
415  Unsupported Media Type
500  Internal Server Error
501  Not Implemented
502  Bad Gateway
503  Service Unavailable
504  Gateway Time-out
505  HTTP Version not supported

You can use ! before a code to mean "if not." !200 means "log this if the response was not OK." Let's put this in salesmen:

<VirtualHost sales.butterthlies.com>
LogFormat "sales: host %!200h, logname %!200l, user %u, time %t, request %r,
    status %s,bytes %b,"
...

An attempt to log in as fred with the password don't know produces the following entry:

sales: host 192.168.123.1, logname unknown, user fred, time [19/Aug/
    1996:07:58:04 +0000], request GET HTTP/1.0, status 401, bytes -

However, if it had been the infamous bill with the password theft, we would see:

host -, logname -, user bill, ...

because we asked for host and logname to be logged only if the request was not OK. We can combine more than one condition, so that if we only want to know about security problems on sales, we could log usernames only if they failed to authenticate:

LogFormat "sales: bad user: %400,401,403u"

We can also extract data from the HTTP headers in both directions:

%[condition]{user-agent}i

This prints the user agent (i.e., the software the client is running) if condition is met. The old way of doing this was AgentLog logfile and ReferLog logfile.

10.3 Configuration Logging

Apache is able to report to a client a great deal of what is happening to it internally. The necessary module is contained in the mod_info.c file, which should be included at build time. It provides a comprehensive overview of the server configuration, including all installed modules and directives in the configuration files. This module is not compiled into the server by default. To enable it, either load the corresponding module if you are running Win32 or Unix with DSO support enabled, or add the following line to the server build Config file and rebuild the server:

AddModule modules/standard/mod_info.o

It should also be noted that if mod_info is compiled into the server, its handler capability is available in all configuration files, including per-directory files (e.g., .htaccess). This may have security-related ramifications for your site. To demonstrate how this facility can be applied to any site, the Config file on .../site.info is the .../site.authent file slightly modified:

User webuser
Group webgroup
ServerName www.butterthlies.com
	
NameVirtualHost 192.168.123.2

LogLevel debug

<VirtualHost www.butterthlies.com>
#CookieLog logs/cookies
AddModuleInfo mod_setenvif.c "This is what I've added to mod_setenvif"
ServerAdmin [email protected]
DocumentRoot /usr/www/APACHE3/site.info/htdocs/customers
ServerName www.butterthlies.com
ErrorLog /usr/www/APACHE3/site.info/logs/error_log
TransferLog /usr/www/APACHE3/site.info/logs/customers/access_log
ScriptAlias /cgi-bin /usr/www/APACHE3/cgi-bin

<Location /server-info>
SetHandler server-info
</Location>

</VirtualHost>

<VirtualHost sales.butterthlies.com>
CookieLog logs/cookies
ServerAdmin [email protected]
DocumentRoot /usr/www/APACHE3/site.info/htdocs/salesmen
ServerName sales.butterthlies.com
ErrorLog /usr/www/APACHE3/site.info/logs/error_log
TransferLog /usr/www/APACHE3/site.info/logs/salesmen/access_log
ScriptAlias /cgi-bin /usr/www/APACHE3/cgi-bin
<Directory /usr/www/APACHE3/site.info/htdocs/salesmen>
AuthType Basic
#AuthType Digest
AuthName darkness

AuthUserFile /usr/www/APACHE3/ok_users/sales
AuthGroupFile /usr/www/APACHE3/ok_users/groups

#AuthDBMUserFile /usr/www/APACHE3/ok_dbm/sales
#AuthDBMGroupFile /usr/www/APACHE3/ok_dbm/groups

#AuthDigestFile /usr/www/APACHE3/ok_digest/sales
require valid-user
satisfy any
order deny,allow
allow from 192.168.123.1
deny from all
#require user daphne bill
#require group cleaners
#require group directors
</Directory>

<Directory /usr/www/APACHE3/cgi-bin>
AuthType Basic
AuthName darkness
AuthUserFile /usr/www/APACHE3/ok_users/sales
AuthGroupFile /usr/www/APACHE3/ok_users/groups
#AuthDBMUserFile /usr/www/APACHE3/ok_dbm/sales
#AuthDBMGroupFile /usr/www/APACHE3/ok_dbm/groups
require valid-user
</Directory>

</VirtualHost>

Note the AddModuleInfo line and the <Location ...> block.

10.3.1 AddModuleInfo

The AddModule directive allows the content of string to be shown as HTML-interpreted additional information for the module module-name.

AddModuleInfo module-name string
Server config, virtual host

For example:

AddModuleInfo mod_auth.c 'See <A HREF="http://www.apache.org/docs/mod/
    mod auth.html">http://www.apache.org/docs/mod/mod_auth.html</A>'

To invoke the module, browse to www.butterthlies.com/server-info,and you will see something like the following:

Apache Server Information
Server Settings, mod_setenvif.c, mod_usertrack.c, mod_auth_digest.c, mod_auth_db.c, 
mod_auth_anon.c, mod_auth.c, mod_access.c, mod_rewrite.c, mod_alias.c, mod_userdir.c, 
mod_actions.c, mod_imap.c, mod_asis.c, mod_cgi.c, mod_dir.c, mod_autoindex.c, mod_
include.c, mod_info.c, mod_status.c, mod_negotiation.c, mod_mime.c, mod_log_config.c, 
mod_env.c, http_core.c 
Server Version: Apache/1.3.14 (Unix)
Server Built: Feb 13 2001 15:20:23
API Version: 19990320:10
Run Mode: standalone
User/Group: webuser(1000)/1003
Hostname/port: www.butterthlies.com:0
Daemons: start: 5 min idle: 5 max idle: 10 max: 256
Max Requests: per child: 0 keep alive: on max per connection: 100
Threads: per child: 0 
Excess requests: per child: 0 
Timeouts: connection: 300 keep-alive: 15
Server Root: /usr/www/APACHE3/site.info
Config File: /usr/www/APACHE3/site.info/conf/httpd.conf
PID File: logs/httpd.pid
Scoreboard File: logs/apache_runtime_status

Module Name: mod_setenvif.c 
Content handlers: none 
Configuration Phase Participation: Create Directory Config, Merge Directory Configs, 
Create Server Config, Merge Server Configs 
Request Phase Participation: Post-Read Request, Header Parse 
Module Directives: 
SetEnvIf - A header-name, regex and a list of variables. 
SetEnvIfNoCase - a header-name, regex and a list of variables. 
BrowserMatch - A browser regex and a list of variables. 
BrowserMatchNoCase - A browser regex and a list of variables. 
Current Configuration: 
Additional Information: 
This is what I've added to mod_setenvif
............

The file carries on to document all the compiled-in modules.

10.4 Status

In a similar way, Apache can be persuaded to cough up comprehensive diagnostic information by including and invoking the module mod_status:

AddModule modules/standard/mod_status.o

This produces invaluable information for the webmaster of a busy site, enabling her to track down problems before they become disasters. However, since this is really our own business, we don't want the unwashed mob out on the Web jostling to see our secrets. To protect the information, we therefore restrict it to a whole or partial IP address that describes our own network and no one else's.

10.4.1 Server Status

For this exercise, which includes info as previously, the httpd.conf in ... /site.status file should look like this:

User webuser
Group webgroup
ServerName www.butterthlies.com
DocumentRoot /usr/www/APACHE3/site.status/htdocs
ExtendedStatus on

<Location /status>
order deny,allow
allow from 192.168.123.1
deny from all
SetHandler server-status
</Location>

<Location /info>
order deny,allow
allow from 192.168.123.1
deny from all
SetHandler server-status
SetHandler server-info
</Location>

The allow from directive keeps our laundry private.

Remember the way order works: the last entry has the last word. Notice also the use of SetHandler , which sets a handler for all requests to a directory, instead of AddHandler, which specifies a handler for particular file extensions. If you then access www.butterthlies.com/status, you get this response:

Apache Server Status for www.butterthlies.com
Server Version: Apache/1.3.14 (Unix)
Server Built: Feb 13 2001 15:20:23

Current Time: Tuesday, 13-Feb-2001 16:03:30 GMT
Restart Time: Tuesday, 13-Feb-2001 16:01:49 GMT
Parent Server Generation: 0 
Server uptime: 1 minute 41 seconds
Total accesses: 21 - Total Traffic: 49 kB
CPU Usage: u.0703125 s.015625 cu0 cs0 - .0851% CPU load
.208 requests/sec - 496 B/second - 2389 B/request
1 requests currently being processed, 5 idle servers 
_W___  _..........................................................
................................................................
................................................................
................................................................
Scoreboard Key: 
"_" Waiting for Connection, "S" Starting up, "R" Reading Request,
"W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
"L" Logging, "G" Gracefully finishing, "." Open slot with no current process 

Srv PID  Acc      M CPU  SS Req Conn Child Slot Client        VHost 
Request 
0-0 2434 0/1/1    _ 0.01 93   5  0.0  0.00 0.00 192.168.123.1 www.butterthlies.com 
GET /status HTTP/1.1
1-0 2435 20/20/20 W 0.08  1   0 47.1  0.05 0.05 192.168.123.1 www.butterthlies.com 
GET /status?refresh=2 HTTP/1.1

Srv   Child Server number - generation 
PID   OS process ID 
Acc   Number of accesses this connection / this child / this slot 
M     Mode of operation 
CPU   CPU usage, number of seconds 
SS    Seconds since beginning of most recent request 
Req   Milliseconds required to process most recent request 
Conn  Kilobytes transferred this connection 
Child Megabytes transferred this child 
Slot  Total megabytes transferred this slot  

There are several useful variants on the basic status request made from the browser:

status?notable

Returns the status without using tables, for browsers with no table support

status?refresh

Updates the page once a second

status?refresh=<n>

Updates the page every <n> seconds

status?auto

Returns the status in a format suitable for processing by a program

These can also be combined by putting a comma between them, i.e., http://www.butterthlies.com/status?notable,refresh=10.

10.4.2 ExtendedStatus

The ExtendedStatus directive controls whether the server keeps track of extended status information for each request.

ExtendedStatus On|Off
Default: Off
server config 

This is only useful if the status module is enabled on the server.

This setting applies to the entire server and cannot be enabled or disabled on a VirtualHost-by-VirtualHost basis. It can adversely affect performance.

[1]  Written by one of the authors of this book (BL).

[2]  Actually, some log analyzers support some extra information in the log file, but you need to read the analyzer's documentation for details.

CONTENTS